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Abstract 

To scale Gaussian processes (GPs) to large data 
sets we introduce the robust Bayesian Com¬ 
mittee Machine (rBCM), a practical and scal¬ 
able product-of-experts model for large-scale 
distributed GP regression. Unlike state-of-the- 
art sparse GP approximations, the rBCM is con¬ 
ceptually simple and does not rely on inducing 
or variational parameters. The key idea is to 
recursively distribute computations to indepen¬ 
dent computational units and, subsequently, re¬ 
combine them to form an overall result. Efficient 
closed-form inference allows for straightforward 
parallelisation and distributed computations with 
a small memory footprint. The rBCM is in¬ 
dependent of the computational graph and can 
be used on heterogeneous computing infrastruc¬ 
tures, ranging from laptops to clusters. With suf¬ 
ficient computing resources our distributed GP 
model can handle arbitrarily large data sets. 


1. Introduction 

Gaussian processes (GPs) (Rasmussen & Williams, 2006) 
are the method of choice for probabilistic nonlinear re¬ 
gression; Their non-parametric nature allows for flexi¬ 
ble modelling without specifying low-level assumptions 
(e.g., the degree of a polynomial) in advance. Inference 
can be performed in a principled way simply by apply¬ 
ing Bayes’ theorem. GPs have had substantial impact in 
geostatistics (Cressie, 1993), optimisation (Jones et al., 
1998; Brochu et al., 2009), data visualisation (Lawrence, 
2005), robotics and reinforcement learning (Deisenroth 
et al., 2015), spatio-temporal modelling (Luttinen & Bin, 
2012), and active learning (Krause et al., 2008). A strength 
of GPs is that they are a fairly reliable black-box func- 
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tion approximator, i.e., they produce reasonable predictions 
without manual parameter tuning. However, their inher¬ 
ent weakness is that they scale poorly with the size N of 
the data set; Training and predicting scale in 0{N^) and 
respectively, which practically limits GPs to data 
sets of size 0(10^). 

For large data sets (e.g., N > 10"^) sparse approxima¬ 
tions are often used (Williams & Seeger, 2001; Quinonero- 
Candela & Rasmussen, 2005; Hensman et al., 2013; Tit- 
sias, 2009; Lazaro-Gredilla et al., 2010; Shen et al., 2006; 
Gal et al., 2014). Typically, they lower the computational 
burden by implicitly (or explicitly) using a subset of the 
data, which scales (sparse) GPs to training set sizes N G 
0(10®). Recently, Gal et al. (2014) proposed an approach 
that scales variational sparse GPs (Titsias, 2009) further by 
exploiting distributed computations. In particular, they de¬ 
rive an exact re-parameterisation of the variational sparse 
GP by Titsias (2009) to update the variational parameters 
independently on different computing nodes. However, 
even with sparse approximations it is inconceivable to ap¬ 
ply GPs to training set sizes of W > 0(10^). 

An alternative to sparse approximations is to distribute the 
computations by using independent local “expert” models, 
which operate on subsets of the data. These local mod¬ 
els typically require stationary kernels for a notion of “dis¬ 
tance” and “locality”. Shen et al. (2006) used KD-trees to 
recursively partition the data space into a multi-resolution 
tree data structure, which scale GPs to 0(10^) training 
points. However, no solutions for variance predictions are 
provided, and the approach is limited to stationary kernels. 
Along the lines of exploiting locality, mixture-of-experts 
(MoE) models (Jacobs et al., 1991) have been applied to 
GP regression (Rasmussen & Ghahramani, 2002; Meeds 
& Osindero, 2006; Yuan & Neubauer, 2009). However, 
these models have not primarily been used to speed up 
GP regression, but rather to increase the expressiveness of 
the model, i.e., allowing for heteroscedasticity and non- 
stationarity. Each local model possesses its own set of 
hyper-parameters to be optimised. Predictions are made 
by collecting the predictions of all local expert models, and 
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weighting them using the responsibilities assigned by the 
gating network. Closed-form inference in these models is 
intractable, and approximations typically require MCMC. 
Nguyen & Bonilla (2014) sidestep MCMC inference and 
speed the GP-MoE model up by (i) fixing the number of 
GP experts, (ii) combining it with the pseudo-input sparse 
approximation by Snelson & Ghahramani (2006). This ap¬ 
proach assigns data points to expert probabilistically using 
proximity information provided by stationary kernels and 
scales GPs to 0(10®) data points. 

Product-of-GP-experts models (PoEs) sidestep the weight 
assignment problem of mixture models: Since PoEs multi¬ 
ply predictions made by independent GP experts, the over¬ 
all prediction naturally weights the contribution of each ex¬ 
pert. However, the model tends to be overconfident (Ng 
& Deisenroth, 2014). Cao and Eleet (Cao & Eleet, 2014) 
recently proposed a generalised PoE-GP model in which 
the contribution of an expert in the overall prediction can 
weighted individually. This model is often too conser¬ 
vative, i.e., it over-estimates variances. Tresp’s Bayesian 
Committee Machine (BCM) (Tresp, 2000) can be consid¬ 
ered a PoE-GP model, which provides a consistent frame¬ 
work for combining independent estimators within the 
Bayesian framework, but it suffers from weak experts. 

In this paper, we exploit the fact that the computations of 
PoE models can be distributed amongst individual com¬ 
puting units and propose the robust BCM (rBMC), a new 
family of hierarchical PoE-GP models that (i) includes the 
BCM (Tresp, 2000) and to some degree the generalised 
PoE-GP (Cao & Eleet, 2014) as special cases, (ii) provides 
consistent approximations of a full GP, (iii) scales to ar¬ 
bitrarily large data sets by parallelisation. Unlike sparse 
GPs our rBCM operates on the full data set but distributes 
the computational and memory load amongst a large set of 
independent computational units. The rBCM recursively 
recombines these independent computations to form an ef¬ 
ficient distributed GP inference/training framework. 

A key advantage of the rBCM is that all computations can 
be performed analytically, i.e., no sampling is required. 
With sufficient computing power our model can handle ar¬ 
bitrarily large data sets. We demonstrate that the rBCM 
can be applied to data sets of size 0(10^), which exceeds 
the typical data set sizes sparse GPs deal with by orders 
of magnitude. However, even with limited resources, our 
model is practical: A GP with a million data points can be 
trained in less than half an hour on a laptop. 

2. Problem Set-up and Objective 

We consider a regression problem y = f{x) -f e S IR, 
where x G R^. The Gaussian likelihood p(yjf(x)) = 
N{f{x)^al) accounts for the i.i.d. measurement noise 


e ^ A/^(0, (Tg). The objective is to infer the latent function 
/ from a training data set X = {xi]f^^^y = {yi]f^i. 
Eor small data set sizes N, a Gaussian process (GP) is a 
method of choice for probabilistic non-parametric regres¬ 
sion. A GP is defined as a collection of random variables, 
any finite number of which is Gaussian distributed. A GP 
is fully specified by a mean function m and a covariance 
function k (kernel) with hyper-parameters 'i/j. Without loss 
of generality, we assume that the prior mean function is 0. 

A GP is typically trained by finding hyper-parameters 6 = 
that maximise the log-marginal likelihood 

\ogp{y\X,9) 

= -\{y^{K + alir^y + log \K + all\) , (1) 

where K = k{X, X) G is the kernel matrix. 

Eor a given set of hyper-parameters 0, a training set X,y 
and a test input x* G IR^, the GP posterior predictive dis¬ 
tribution of the corresponding function value /* = /(x») 
is Gaussian with mean and variance given by 

E[f,]=m(x,) = k^(K + a^I)-^y, (2) 

var[/*] = cr^(x*) = kt* - kJ'(K + (rllY^k^ , (3) 

respectively, where fc* = k{X, x*) and fc** = /c(x», x*). 

Training requires the inversion and the determinant of 
K + Yl in (1), both of which scale in 0{N^) with 
a standard implementation. Eor predictions, we cache 
{K -f such that the mean and variance in (2) 

and (3) require 0{N) and 0{N'^) computations, respec¬ 
tively. Eor N > 10,000 training and predicting be¬ 
come time-consuming procedures, which additionally re¬ 
quire 0{N‘^ + ND) memory. 

Throughout this paper, we assume that a standard GP is a 
good model for the latent function /. However, due to the 
data set size N the full GP is not applicable. 

To scale GPs to large data sets with N ^ 0(10^), we ad¬ 
dress both the computational and the memory issues of full 
GPs by distributing the computational and memory loads 
to many independent computational units that only operate 
on subsets of the data. Eor this purpose, we devise a robust 
and scalable hierarchical product-of-GP-experts model. 

3. Distributed Product-of-GP-Experts Models 

Product-of-experts models (PoEs) are generally promising 
for parallelisation and distributed computing. In a PoE 
model, an overall computation is the product of many inde¬ 
pendent (smaller) computations, performed by “experts”. 
In our case, every expert is a GP that accesses only a subset 
of the training data. In this paper, we consider a GP with 
a training data set 22 = {X, y}. We partition the training 
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(a) 1-layer model. 


a 



(b) 2-layer model. 





(c) 3-layer model. 


Figure 1. Computational graphs of hierarchical PoE models. Main computations are at the leaf nodes (GP experts, hlack). All other 
nodes recombine computations from their direct children. The top node (blue) computes the overall prediction. 


data into M sets k = 1,..., M, and 

use a GP on each of them as a (local) expert'. Each GP 
expert performs computations (e.g., mean/variance predic¬ 
tions) conditioned on their respective training data 
These (local) predictions are recombined by a parent node 
(see Fig. 1(a)), which subsequently may play the role of an 
expert at the next level of the model architecture. Recur¬ 
sive application of these recombinations results in a multi¬ 
layered tree-structured computational graph, see Fig. 1(c). 

The assumption of independent GP experts leads to a 
block-diagonal approximation of the kernel matrix, which 
(i) allows for efficient training and predicting (ii) can be 
computed efficiently (time and memory) by parallelisation. 

3.1. Training 

Due to the independence assumption, the marginal likeli¬ 
hood p{y\X, 9) in a PoE model factorises into the product 
of M individual terms, such that 

(4) 

where each factor pk is determined by the fcth GP expert. 
For training the PoE model, we seek GP hyper-parameters 
6 that maximise the corresponding log-marginal likelihood 

logp(y|X,e) (5) 

where M is the number of GP experts. The terms in (5) are 
independently computed and given by 

logp(y('=)|xW,0) = ^ 

- i log + cr^I| + const, 

where is an Uk x rik matrix, and 

rifc iV is the size of the data set associated with the fcth 
GP expert. Since we assume that a standard GP is sufficient 
to model the latent function, all GP experts at the leaves 
of the tree-structured model are trained jointly and share a 
single set of hyper-parameters 9 = {r/^jCTe}. Computing 
the log-marginal likelihood terms in (6) requires the inver¬ 
sion and determinant of + a^I. These computations 

'The notion of “locality” is misleading as our model does not 
require similarity measures induced by stationary kernels. 


require 0{n\) time with a standard implementation. The 
memory consumption is + UkD) for each GP expert. 

In (5), the number of parameters 9 to be optimised is rela¬ 
tively small since we do not consider additional variational 
parameters or inducing inputs that we optimise. Training 
(i) allows for straightforward parallelisation, (ii) provides 
a significant speed-up compared to training a full GP, (iii) 
requires solving low-dimensional ©(Z?) optimisation prob¬ 
lem unlike sparse GPs, which additionally optimise induc¬ 
ing inputs or variational parameters. 

3.2. Predictions 

In the following, we assume that a set of M GP experts has 
been trained according to Section 3.1 and detail how the 
PoE (Ng & Deisenroth, 2014), the generalised PoE (Cao & 
Fleet, 2014) and the Bayesian Committee Machine (Tresp, 
2000) combine predictions of the GP experts to form an 
overall prediction. Furthermore, we highlight strengths and 
weaknesses of these models, which motivates our robust 
Bayesian Committee Machine (rBCM). The rBCM unifies 
many other models while providing additional flexibility, 
which can address the shortcomings of the PoE, gPoE and 
the BCM. For illustration purposes, we focus on the model 
in Fig. 1(a), but many models generalise to an arbitrarily 
deep computational graph (see Section 4). 

3.2.1. Product of GP Experts 

The product-of-GP-experts model predicts a function value 
/* at a corresponding test input according to 

p{f4x^,V) = , (7) 

where M GP experts operate on different training data sub¬ 
sets The M GP experts predict means and 

variances cr^(a;*), /c = 1,..., M, independently. The joint 
prediction p(/» |a:*, V) is obtained by the product of all ex¬ 
perts’ predictions. The product of these Gaussian predic¬ 
tions is proportional to a Gaussian with mean and precision 

, ( 8 ) 

= ( 9 ) 
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(a) Product of GP Experts. 


(b) Generalised Product of GP (c) Bayesian Committee Ma- (d) Robust Bayesian Committee 
Experts. chine. Machine. 


Figure 2. Predictions with four different product-of-experts models. The full GP to be approximated is shown by the shaded area, 
representing 95% of the GP confidence intervals. Training data is shown as black circles. Each GP expert was assigned two data points, 
(a): The standard PoE model does not fall back to the prior when leaving the training data, (b); The generalised PoE model falls back 
to the prior, but over-estimates the variances in the regime of the training data, (c): The BCM is close to the full GP in the range of 
the training data, but the predictive mean can suffer, (d): The rBCM is more robust to weak experts than the (g)PoE and the BCM and 
produces reasonable predictions. 


respectively. For k = 1, this model is identical to the full 
GP we wish to approximate. 

Strengths of the PoE model are (a) the overall prediction 
p(/*|a;*,I9) is straightforward to compute, (b) there are 
no free weight parameters to be assigned to each predic¬ 
tion (unlike in MoE models). A shortcoming of this model 
is that with an increasing number of GP experts the pre¬ 
dictive variances vanish (the precisions add up, see (9)), 
which leads to overconfident predictions, especially in re¬ 
gions without data. Thus, the PoE model is inconsistent in 
the sense that it does not fall back to the prior, see Eig. 2(a). 

3.2.2. Generalised Product of GP Experts 

The generalised product-of-experts model (gPoE) by Cao 
& Eleet (2014) adds the flexibility of increasing/reducing 
the importance of experts. The predictive distribution is 

_^ jVff 

p(/*|a:*,X>) = , (10) 

where the /3k weight the contributions of the experts. The 
predictive mean and precision are, therefore, 

^gpoe ^ (^gpoe)2 (at* (tT, ) , (11) 

(^gpoe)-2 ^ /3fccr-2(a:*), (12) 

respectively. A strength of the gPoE is that with Pk = 
1 the model falls back to the prior outside the range of the 
data (and corresponds to a log-opinion-pool model (Hes- 
kes, 1998)). A weakness of the gPoE is that in the range of 
the data, it over-estimates the variances, i.e., the predictions 
are generally too conservative, especially with an increas¬ 
ing number of GP experts. Pig. 2(b) illustrates these two 
properties. 

Cao & Eleet (2014) suggest to set j3k to the difference in the 
differential entropy between the prior and the posterior to 


determine the importance. In this paper, we do not consider 
this setting for the gPoE for two reasons: (i) /3fc 1 
leads to unreasonable error bars; (ii) Even with a normal¬ 
isation, this setting does not allow for deep computational 
graphs. In fact, it only allows for a single-layer computa¬ 
tional graph, such as Pig. 1(a). To allow for a general com¬ 
putational graph, we require an a-priori setting of the f/k- 
Thus, we set I3k = 1/M, where M is the number of GP 
experts. In this case, the predicted means in (8) and (11) 
are identical, but the precisions differ, see also Pig. 2(a) 
and 2(b) for a comparison. 

3.2.3. Bayesian Committee Machine 

A third model that falls in the category of PoE models is the 
Bayesian Committee Machine (BCM) proposed by Tresp 
(2000). Unlike the (g)PoE, the BCM explicitly incorpo¬ 
rates the GP prior p{f) when combining predictions (and 
not only at the leaves). 

Por two experts j, k and corresponding training data sets 
2 ?(i) ^ the predictive distribution is generally given by 

(13) 

where p(/*) is the GP prior over functions. The 
BCM makes the conditional independence assumption that 
X)(i) _[j_ 29 (fc) 1/^. With (13) this yields 

Pih) ^ ’ 

P(/*) 

which is the PoE model in (7) divided by the GP prior. Por 
M training data sets k = 1,..., M, the BCM applies 
the above approximation repeatedly, leading to the BCM’s 
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posterior predictive distribution 






(17) 


The (M — l)-fold division by the prior is the decisive dif¬ 
ference between the BCM and the PoE model in (7) and 
leads to the BCM’s predictive mean and precision 

^bcm ^ a-\xMx.) (18) 

( aS -)-2 ^ ^ _ ^)^-2 ^ ( 19 ) 

respectively, where is the prior precision of p{f*). 

The repeated application of Bayes’s theorem leads to an 
(M — l)-fold division by the prior in (17), which plays the 
role of a “correction” term in (19) that ensures a consistent 
model that falls back to the prior.^ The error bars of the 
BCM within the range of the data are usually good, but it 
is possible to “break” the BCM when only few data points 
are assigned to each GP expert. In Fig. 2(c), we see that the 
posterior mean suffers from weak experts when leaving the 
data (around x = 0).^ 


3.2.4. Robust Bayesian Committee Machine 

In this section, we propose the robust Bayesian Commit¬ 
tee Machine (rBCM), a unified model that (a) includes the 
gPoE and BCM as special cases, (b) yields consistent pre¬ 
dictions, (c) can be implemented on a distributed comput¬ 
ing architecture. 

Inspired by the gPoE in (10), the rBCM is a BCM with the 
added flexibility of increasing/decreasing an expert’s im¬ 
portance. The rBCM’s predictive distribution is 


predictive variance falls back to the prior when leaving the 
data. Note that we no longer require /^fc = 1 to en¬ 
sure this (see also (Bordley, 1982)), which will facilitate 
computational graphs with multiple layers. The gPoE from 
Section 3.2.2 and the BCM from Section 3.2.3 are recov¬ 
ered for Pk = 1/M and /3fc = 1, respectively. For Pk = 0, 
the rBCM is identical to the GP prior, and for Pi^ = 1 but 
without the correction, the rBCM recovers the PoE from 
Section 3.2.1 

The parameters Pk control not only the importance of the 
individual experts, but they also control how strong the 
influence of the prior is. Assuming each GP expert is a 
good predictive model, we would set Pk = 1 for all k, 
such that we retain the BCM. If the quality of the GP 
experts is weak, e.g., data is noisy and the experts’ data 
sets are small, Pk allows us to weaken the experts’ 
votes and to robustify the predictive model by putting (rel¬ 
atively) more weight on the prior. Therefore, we follow the 
suggestion by Cao & Fleet (2014) and choose Pk accord¬ 
ing to the predictive power of each expert at tc*. Specif¬ 
ically, we use the difference in differential entropy be¬ 
tween the prior p(/* jx*) and the posterior (/* | x*, ). 

This quantity can be computed efficiently and is given as 
Pk = 5 (logo-** - logcr^(x*)), where is the prior vari¬ 
ance and (x*) is the predictive variance of the /cth expert. 

Fig. 2(d) shows that for this choice of Pk the rBCM ex¬ 
presses more uncertainty about the learned model than the 
BCM; Due to the adaptive influence of the prior in (21)- 
(22), the variances within the range of the data (black cir¬ 
cles) are slightly larger, but the predictive mean no longer 
suffers from the dominant “kink” at around x = 0 com¬ 
pared to the BCM in Fig. 2(c). Overall, the rBCM provides 
more reasonable predictions than any other model in Fig. 2. 


p(/*|x*,X>) 




( 20 ) 


where the predictive mean and precision are given as 

/if” = (af ”)^ Pkap\xMx*) (21) 

(^rbc™)-2 ^ Pkap\x,) + (1 - , 

( 22 ) 


respectively. The derivation of the rBCM in (20) is analo¬ 
gous to the BCM’s derivation in (14)-(16). 

The rBCM combines the flexibility of the generalised 
PoE with the appropriate Bayesian treatment of the BCM, 
which leads to the correction term (1 — Pk)<^P^ in the 

the precision in (22). This correction term ensures that the 

^The BCM predictions fall into the category of normalised 
group odds (Bordley, 1982). 

^Here, each expert was assigned only two data points. 


4. Distributed Computations 


In the following, we show that for a 
given number M of GP experts, the 
rBCM can be implemented in differ¬ 
ent computational graphs while pro¬ 
viding identical predictions. For in¬ 
stance, with 32 experts, we show that a 
single-layer computational graph with 
32 experts and one central node, see 
Fig. 1(a), is equivalent to a two-layer 
computational graph with 32 experts, 
8 parent nodes (each of which is re¬ 
sponsible for 4 GP experts), and one 
central node. This property can be ex- 



Figure 3. General 
architecture for 
the rBCM with 
arbitrarily many 


layers. 

ploited in distributed systems or to balance the communi¬ 
cation between computing units. 


In a single-layer model as shown in Fig. 1(a) the rBCM 
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predictions in (20) can be constructed by a gPoE (numer¬ 
ator) combining predictions of GP experts, followed by a 
correction via the prior (denominator). Let us consider 
a two-layer model with M GP experts, L nodes at level 
1 and one central node at level 2, see Fig. 1(b). Each 
grey node at level 1 is responsible for L}^ GP experts 
(black nodes). All M GP experts ki, compute weighted 
predictive distributions p^'^' (/* These predictions 

are then multiplied by the corresponding L parent nodes 
(grey) to pf*" (/* where /3fe = Y^iPki is the over¬ 

all weight of the subtree following the fcth node and 
The overall prediction at the top node 

(blue in Fig. 1) is 

nf^i pt ) n11 uti pf:- ) ,,,, 

where we accounted for the Pk — l)-fold correction 
by the prior. This computation can be obtained by a gPoE 
(black), followed by a PoE (red) and a prior correction 
(blue) in (23). Hence, for a given number of GP experts, 
the rBCM predictions can be equivalently realised in a sin¬ 
gle and two-layer computational graph, see Fig. 1. 

This can be generalised further to an arbitrarily deep com¬ 
putational graph, whose general implementation structure 
is shown in Fig. 3. The GP experts at the leaves compute 
their individual means, variances and confidence values 
Pi- The next layer consists of gPoE models, which com¬ 
pute the weighted means and variances according to (8) 
and (12), respectively (plus their overall weights /3fc = 
SiGchiidren/^fc.’ which are passed on to the next-higher 
level). The gPoE is followed by an arbitrary number of 
PoE models, which compute means and precisions accord¬ 
ing to (8) and (9), respectively (plus their overall weights 
Pk = EjGchiidren/^fcj’ ^hich are passed on to the next- 
higher level). The top layer accounts for the prior (blue 
term in (23)), which uses all the pk from the subtrees start¬ 
ing at its children to compute the overall mean and preci¬ 
sion according to (21) and (22), respectively. 

Hence, for a given number of GP experts, there are many 
equivalent computational graphs for the rBCM (and the 
other distributed GPs discussed in Section 3). For instance, 
all computational graphs in Fig. 1 return identical predic¬ 
tive distributions. This allows us to choose the compu¬ 
tational graph that works best with the computing infras¬ 
tructure available: Shallow graphs minimise overall traf¬ 
fic. However, they are more vulnerable to communication 
bottlenecks at the central node since it has a large number 
of connections. Deeper computational graphs cause more 
overall communication, but the tree has a smaller branch¬ 
ing factor. In practice, for a set of computational graphs 
we took the time it requires to compute the gradient of the 
marginal likelihood and chose the “fastest” architecture. 



Figure 4. Computation time for the log-marginal likelihood and 
its gradient with respect to the kernel hyper-parameters as a func¬ 
tion of the size of the training data. The distributed GPs (DGP), 
i.e., (g)PoE and (r)BCM, scale favourably to large-scale data sets. 

5. Experiments 

We empirically assess three aspects of all distribute GP 
models: (1) The required training time, (2) the approxima¬ 
tion quality, (3) a comparison with state-of-the-art sparse 
GP methods. In all experiments, we chose the standard 
squared exponential kernel with automatic relevance de¬ 
termination and a Gaussian likelihood. Moreover, we as¬ 
signed training data to experts randomly for two reasons: 
First, we demonstrate that our models do not need locality 
information; second, random assignment is very fast com¬ 
pared to clustering methods, e.g., KD-trees. 

5.1. Training Time for Large Data Sets 

To evaluate the training time for distributed GPs (DGP)"^, 
we measured the time required to compute the log-marginal 
likelihood and its gradient with respect to the kernel hyper¬ 
parameters. Since the model is trained using LBFGS, the 
overall training time is proportional to the time it takes to 
compute the log-marginal likelihood and its gradient. For 
this evaluation, we chose a computer architecture of 64 
nodes with four cores each. Furthermore, we chose a three- 
layer computational graph with varying branching factors. 
For data sets of < data points each GP expert possessed 
512 data points, for data set sizes of > 2^°, we chose the 
number of data points per node to be 128. 

Fig. 4 shows the time required for computing the,log- 
marginal likelihood and its gradient with respect to the 
hyper-parameters. The horizontal axis shows the size of the 
training set (logarithmic scale), the left vertical axis shows 
the computation time in seconds (logarithmic scale) for the 
DGP (blue-dashed), a full GP (red-dashed) and a sparse 
GP (FITC) with inducing inputs (Snelson & Ghahramani, 
2006) (green-dashed). For the sparse GP model, we chose 

"'This comprises the (g)PoE and (r)BCM for which the training 
time is identical. 
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the number M of inducing inputs to be 10% of the size 
of the training set, i.e., the computation time is of the or¬ 
der of 0{NM'^) = O(A^^/100), which offsets the curve 
of the full GP. The right vertical axis shows the number 
of GP experts (black-solid) amongst which we distribute 
the computation. While the training time of the full GP 
becomes impractical at data set sizes of about 20,000, the 
sparse GP model can be reasonably trained up to 50,000 
data points.^ The computational time required for the DGP 
to compute the marginal likelihood and gradients is signifi¬ 
cantly lower than that of the full GP, and we scaled it up to 
2^^ « 1.7 X 10^ training data points, which required about 
the same amount of time (« 230 s) for training a full GP 
with 2^^ « 1.6 X 10^ and a sparse GP with 2^^ « 3.2 x 10^ 
data points. The figure shows that for any problem size we 
can find a computational graph that allows us to train the 
model within a reasonable amount of time. 

Even if a big computing infrastructure is not available our 
model is useful in practice: We performed a full training 
cycle of the DGP with 10® data points on a standard laptop 
in about 20 minutes. This is also a clear indicator that the 
memory consumption of the DGP is relatively small. 

5.2. Empirical Approximation Errors 

We evaluate the predictive quality of the DGP models in¬ 
troduced in Section 3 on the Kin40K data set. The data set 
represents the forward dynamics of an 8-link all-revolute 
robot arm. The goal is to predict the distance of the end- 
effector from a target, given the joint angles of the eight 
links as features. The Kin40K data set consists of 10,000 
training points and 30,000 test points. We use the same split 
into training and test data as Seeger et al. (2003), Lazaro- 
Gredilla et al. (2010), and Nguyen & Bonilla (2014). 

We considered two baselines: a full (ground truth) GP and 
the subset-of-data (SOD) approximation, which uses a ran¬ 
dom subset of the full training data set to train a sparse GP. 
Taking training time into account, Chalupka et al. (2013) 
identified the SOD method as a good and robust sparse 
GP approximation. All approximate models (PoE, gPoE, 
BCM, rBCM, SOD) used the hyper-parameters from the 
full GP to eliminate issues with local optima. Eor ev¬ 
ery model, we took the time for computing the gradient 
of the marginal likelihood (training is proportional to this 
amount of time). We selected the number of data points 
for SOD, such that the gradient computation time approxi¬ 
mately matches the one of the DGPs. 

Experiments were repeated 10 times to average out the ef¬ 
fect of the random assignment of data points to experts and 
the selection of the subset of training data for the SOD ap- 

^We did not include any computational overhead for learning 
the inducing inputs, which is often time consuming. 




(b) NLPD. 


Figure 5. Performance as a function of training time (lower hori¬ 
zontal axes) and the number of data points per expert (upper hor¬ 
izontal axes), (a) RMSE, (b) NLPD. Standard errors (not dis¬ 
played) are smaller than 10“^. 


proximation. Eor this experiment, we used a Virtual Ma¬ 
chine with 16 3 GHz cores and 8 GB RAM. 

Eig. 5 shows the average performance (RMSE and nega¬ 
tive log predictive density (NLPD) per data point) of all 
models as a function of (a) the time to compute the gra¬ 
dient of the marginal likelihood and (b) the number of 
training data points per GP expert. The full GP (black, 
dashed) shows the ground-truth performance, but requires 
1162 seconds for the gradient computation. The perfor¬ 
mances of the (r)BCM and the (g)POE have been eval¬ 
uated using 256, 64, 16, 4, and 1 expert with 39, 156, 
625, 2500 and 10000 data points per expert, respectively. 
In Pig. 5(a), the rBCM consistently outperforms all other 
methods, where SOD is substantially worse than all DGPs. 
The NLPD in Pig. 5(b) allows us to make some more con¬ 
clusive statements: While the rBCM again outperforms all 
other methods, the BCM and the PoE’s performances suf¬ 
fer when only a small number of data points is assigned 
to the GP experts. The PoE suffers from variance under¬ 
estimation (see Pig. 2(a)) whereas the BCM suffers from 
“weak” experts (see Pig. 2(c)). SOD does not work well, 
even with 2500 data points. Overall, the rBCM provides an 
enormous training speed-up compared to a full GP, with a 
significantly better predictive performance than SOD. 

5.3. Airline Delays (US Elight Data) 

We assess the performance of the distributed GPs (PoE, 
gPoE, BCM and rBCM) on a large-scale non-stationary 
data set reporting flight arrival and departure times for 
every commercial fight in the US from January to April 
2008®. This data set contains information about almost 6 
million flights. We followed the procedure described by 
Hensman et al. (2013)^ to predict the flight delay (in min¬ 
utes) at arrival: We selected the first P data points to train 
the model and the following 100,000 to test it. We chose the 
same eight input variables x as Hensman et al. (2013): age 
of the aircraft, distance that needs to be covered, airtime, 

®http : //Stat-computing . org/dataexpo/200 9/ 

^Thanks to J Hensman for the pre-processing script. 
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Table 1. US Flight Data Set. SVIGP and Dist-VGP results are 
reported from the respective papers. Results with are the best 
solution found. Good and bad performances are highlighted in 
blue and red, respectively. 



700K/100K 

2M/100K 

5M/100K 


RMSE 

NLPD 

RMSE 

NLPD 

RMSE 

NLPD 

SVIGP 

33.0 

— 

— 

— 

— 

— 

Dist-VGP+ 

33.0 

— 

35.3 

— 

— 

— 

rBCM 

27.1 

9.1 

34.4 

8.4 

35.5 

8.8 

BCM 

33.5 

14.7 

55.8 

17.0 

55.3 

19.5 

gPoE 

28.7 

8.1 

35.5 

8.6 

37.3 

8.7 

PoE 

28.7 

14.1 

35.5 

26.3 

37.3 

17.7 

SOD 

> 10 ^ 

22.6 

> 10 ^ 

17.2 

> lO '^ 

16.4 


departure and arrival times, day of the week and month, 
month. This data set has been evaluated by Hensman et al. 
(2013) and Gal et al. (2014), who use either Stochastic 
Variational Inference GP (SVIGP) or Distributed Varia¬ 
tional GPs (Dist-VGP) to deal with this training set size. 

Using a workstation with 12 3.5 GHz cores and 32 GB 
RAM, we conducted 10 experiments with P — 7 x 10®, 
P = 2x 10® and P = 5 x 10® where we chose 4096, 8192 
and 32768 experts corresponding to 170, 244 and 152 train¬ 
ing data points per expert, respectively. The computation of 
the marginal likelihoods required 13, 39, and 90 seconds, 
respectively.^ The computational graphs were (16-16-16), 
(8192), and (32-32-32), respectively, where each number 
denotes the branching factor at the corresponding level in 
the tree. After normalising the data every single experiment 
consisted of an independent full training/test cycle, with 
random assignments of data points to GP experts. Training 
the DGPs normally required 30-100 line searches to con¬ 
verge. Table 1 reports the performance (RMSE and NLPD) 
of various large-scale GP methods for the flight data set. 
The results for SVIGP and Dist-VGP are taken from Hens¬ 
man et al. (2013) and Gal et al. (2014), respectively. Since 
SVIGP and Dist-VGP are difficult to optimise due to the 
large number of variational parameters. Gal et al. (2014) 
report only their best results, whereas we report an average 
of all experiments conducted. 

The standard errors (not shown in Table 1) of the rBCM and 
gPoE are consistently below 0.3, whereas the BCM and the 
PoE suffered from a few outliers, which is also indicated by 
the relatively large NLPD values. Compared to the Dist- 
VGP and SVIGP on the 700K data set, the rBCM, gPoE 
and PoE perform significantly better in RMSE. The table 
highlights the weaknesses of the PoE (under-estimation of 
the variance) and the BCM (problems with weak experts) 
very clearly. The property of the gPoE (too conservative) 
is a bit hidden: Although the RMSE of the gPoE is consis¬ 
tently worse than that of the rBCM, its NLPD tends to be a 
bit lower. The NLPD values of the rBCM and the gPoE are 

*A11 experiments can be conducted on a MacBook Air (2012) 
with 8 GB RAM. 


fairly consistent across all three experiments. 

The data set exhibits the property that the 700K/100K data 
set is more stationary than the 2M/100K and 5M/100K data 
sets. Therefore, we observe a decreasing performance al¬ 
though we include more training data. This effect has al¬ 
ready been reported by Gal et al. (2014). 

6. Discussion and Conclusion 

The distributed GP models discussed in this paper exhibit 
the appealing property that they are conceptually simple: 
They split the data set into small pieces, train GP experts 
jointly, and subsequently combine individual predictions 
to an overall computation. Compared to sparse GP mod¬ 
els, which are based on inducing or variational parame¬ 
ters, the distributed GPs possess only the normal GP hyper¬ 
parameters, i.e., it is less likely to end up in local optima. 

In this paper, all experts share the same hyper-parameters, 
which leads to automatic regularisation: The overall gradi¬ 
ent is an average of the experts’ marginal likelihood gradi¬ 
ents, i.e., overfitting of individual experts is not favoured. 

We believe that distributed computations are promising to 
scale GPs to truly large data sets. Sparse methods using in¬ 
ducing inputs are naturally limited by the number of induc¬ 
ing variables. In practice, we see 0(10^) many inducing 
variables (optimisation is hard), although their theoretical 
limit might be at 0(10"^), if we assume that this is the limit 
up to which we can train a full GP. If we assume that a sin¬ 
gle inducing variable summarises 100 data points, the prac¬ 
tical limit of sparse methods are data sets of size 0(10®). 
This can be extended by either a higher compression rate 
or parallelisation (Hensman et al., 2013; Gal et al., 2014). 

We introduced the robust Bayesian Committee Machine 
(rBCM), a conceptually straightforward, but effective, 
product-of-GP-experts model that scales GPs to (in princi¬ 
ple) arbitrarily large data sets. The rBCM distributes com¬ 
putations amongst independent computing units, allowing 
for straightforward parallelisation. Recursive and closed- 
form recombinations of these independent computations 
yield a practical model that is both computationally and 
memory efficient. The rBCM addresses shortcomings of 
other distributed models by appropriately incorporating the 
GP prior when combining predictions. The rBCM is inde¬ 
pendent of the computational graph and can be equivalently 
used on heterogeneous computing infrastructures, ranging 
from laptops to large clusters. Training an rBCM with a 
million data points takes less than 30 minutes on a laptop. 
With more computing power training the rBCM with 10^ 
data points can be done in a few hours. Compared to state- 
of-the-art sparse GPs, our model is conceptually simple, 
performs well, learns fast, requires little memory and does 
not require high-dimensional optimisation. 
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